{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/cookbooks/airtrain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# AirtrainAI Cookbook\n",
    "\n",
    "[Airtrain](https://www.airtrain.ai/) is a tool supporting unstructured/low-structured text datasets. It allows automated clustering, document classification, and more.\n",
    "\n",
    "This cookbook showcases how to ingest and transform/enrich data with LlamaIndex and then upload the data to Airtrain for further processing and exploration."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation & Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install some libraries we'll use for our examples. These\n",
    "# are not required to use Airtrain with LlamaIndex, and are just\n",
    "# there to help us illustrate use.\n",
    "%pip install llama-index-embeddings-openai==0.2.4\n",
    "%pip install llama-index-readers-web==0.2.2\n",
    "%pip install llama-index-readers-github==0.2.0\n",
    "\n",
    "# Install Airtrain SDK with LlamaIndex integration\n",
    "%pip install airtrain-py[llama-index]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Running async code in a notebook requires using nest_asyncio, and we will\n",
    "# use some async examples. So we will set up nest_asyncio here. Outside\n",
    "# an async context or outside a notebook, this step is not required.\n",
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### API Key Setup\n",
    "\n",
    "Set up the API keys that will be required to run the examples that follow.\n",
    "The GitHub API token and OpenAI API key are only required for the example\n",
    "'Usage with Readers/Embeddings/Splitters'. Instructions for getting a GitHub\n",
    "access token are\n",
    "[here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens)\n",
    "while an OpenAI API key can be obtained\n",
    "[here](https://platform.openai.com/api-keys).\n",
    "\n",
    "To obtain your Airtrain API Key:\n",
    "- Create an Airtrain Account by visting [here](https://app.airtrain.ai/api/auth/login)\n",
    "- View \"Settings\" in the lower left, then go to \"Billing\" to sign up for a pro account or start a trial\n",
    "- Copy your API key from the \"Airtrain API Key\" tab in \"Billing\"\n",
    "\n",
    "Note that the Airtrain trial only allows ONE dataset at a time. As this notebook creates many, you may need\n",
    "to delete the dataset in the Airtrain UI as you go along, to make space for another one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"GITHUB_TOKEN\"] = \"<your GitHub token>\"\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"<your OpenAi API key>\"\n",
    "\n",
    "os.environ[\"AIRTRAIN_API_KEY\"] = \"<your Airtrain API key>\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 1: Usage with Readers/Embeddings/Splitters\n",
    "\n",
    "Some of the core abstractions in LlamaIndex are [Documents and Nodes](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/).\n",
    "Airtrain's LlamaIndex integration allows you to create an Airtrain dataset using any iterable collection of either of these, via the\n",
    "`upload_from_llama_nodes` function.\n",
    "\n",
    "To illustrate the flexibility of this, we'll do both:\n",
    "1. Create a dataset directly of documents. In this case whole pages from the [Sematic](https://docs.sematic.dev/) docs.\n",
    "2. Use OpenAI embeddings and the `SemanticSplitterNodeParser` to split those documents into nodes, and create a dataset from those."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import airtrain as at\n",
    "from llama_index.core.node_parser import SemanticSplitterNodeParser\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "from llama_index.readers.github import GithubRepositoryReader, GithubClient"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is to set up our reader. In this case we're using the GitHub reader, but that's just for illustrative purposes. Airtrain can ingest documents no matter what reader they came from originally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "github_token = os.environ.get(\"GITHUB_TOKEN\")\n",
    "github_client = GithubClient(github_token=github_token, verbose=True)\n",
    "reader = GithubRepositoryReader(\n",
    "    github_client=github_client,\n",
    "    owner=\"sematic-ai\",\n",
    "    repo=\"sematic\",\n",
    "    use_parser=False,\n",
    "    verbose=False,\n",
    "    filter_directories=(\n",
    "        [\"docs\"],\n",
    "        GithubRepositoryReader.FilterType.INCLUDE,\n",
    "    ),\n",
    "    filter_file_extensions=(\n",
    "        [\n",
    "            \".md\",\n",
    "        ],\n",
    "        GithubRepositoryReader.FilterType.INCLUDE,\n",
    "    ),\n",
    ")\n",
    "read_kwargs = dict(branch=\"main\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the documents with the reader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = reader.load_data(**read_kwargs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create dataset directly from documents\n",
    "\n",
    "You can create an Airtrain dataset directly from these documents without doing any further\n",
    "processing. In this case, Airtrain will automatically embed the documents for you before\n",
    "generating further insights. Each row in the dataset will represent an entire markdown\n",
    "document. Airtrain will automatically provide insights like semantic clustering of your\n",
    "documents, allowing you to browse through the documents by looking at ones that cover similar\n",
    "topics or uncovering subsets of documents that you might want to remove.\n",
    "\n",
    "Though additional processing beyond basic document retrieval is not *required*, it is\n",
    "*allowed*. You can enrich the documents with metadata, filter them, or manipulate them\n",
    "in any way you like before uploading to Airtrain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Uploaded 42 rows to 'Sematic Docs Dataset: Whole Documents'. View at: https://app.airtrain.ai/dataset/7fd09dca-81b9-42b8-acc9-01ce08302b16\n"
     ]
    }
   ],
   "source": [
    "result = at.upload_from_llama_nodes(\n",
    "    documents,\n",
    "    name=\"Sematic Docs Dataset: Whole Documents\",\n",
    ")\n",
    "print(f\"Uploaded {result.size} rows to '{result.name}'. View at: {result.url}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create dataset after splitting and embedding\n",
    "\n",
    "If you wish to view a dataset oriented towards nodes within documents rather than whole documents, you can do that as well.\n",
    "Airtrain will automatically create insights like a 2d PCA projection of your embedding vectors, so you can visually explore\n",
    "the embedding space from which your RAG nodes will be retrieved. You can also click on individual rows and look at the ones\n",
    "that are nearest to it in the full n-dimensional embedding space, to drill down further. Automated clusters and other insights\n",
    "will also be generated to enrich and aid your exploration.\n",
    "\n",
    "Here we'll use OpenAI embeddings and a `SemanticSplitterNodeParser` splitter, but you can use any other LlamaIndex tooling you\n",
    "like to process your nodes before uploading to Airtrain. You can even skip embedding them yourself entirely, in which case\n",
    "Airtrain will embed the nodes for you."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embed_model = OpenAIEmbedding()\n",
    "splitter = SemanticSplitterNodeParser(\n",
    "    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model\n",
    ")\n",
    "nodes = splitter.get_nodes_from_documents(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⚠️ **Note** ⚠️: If you are on an Airtrain trial and already created a whole-document dataset, you will need to delete it before uploading a new dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Uploaded 137 rows to Sematic Docs, split + embedded. View at: https://app.airtrain.ai/dataset/ebec9bcc-6ed8-4165-a0de-29bef740c70b\n"
     ]
    }
   ],
   "source": [
    "result = at.upload_from_llama_nodes(\n",
    "    nodes,\n",
    "    name=\"Sematic Docs, split + embedded\",\n",
    ")\n",
    "print(f\"Uploaded {result.size} rows to {result.name}. View at: {result.url}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 2: Using the [Workflow](https://docs.llamaindex.ai/en/stable/module_guides/workflow/#workflows) API\n",
    "\n",
    "Since documents and nodes are the core abstractions the Airtrain integration works with, and these abstractions are\n",
    "shared in LlamaIndex's workflows API, you can also use Airtrain as part of a broader workflow. Here we will illustrate\n",
    "usage by scraping a few [Hacker News](https://news.ycombinator.com/) comment threads, but again you are not restricted\n",
    "to web scraping workflows; any workflow producing documents or nodes will do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import asyncio\n",
    "\n",
    "from llama_index.core.schema import Node\n",
    "from llama_index.core.workflow import (\n",
    "    Context,\n",
    "    Event,\n",
    "    StartEvent,\n",
    "    StopEvent,\n",
    "    Workflow,\n",
    "    step,\n",
    ")\n",
    "from llama_index.readers.web import AsyncWebPageReader\n",
    "\n",
    "from airtrain import DatasetMetadata, upload_from_llama_nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specify the comment threads we'll be scraping from. The particular ones in this example were on or near the front page on September 30th, 2024. If\n",
    "you wish to ingest from pages besides Hacker News, be aware that some sites have their content rendered client-side, in which case you might\n",
    "want to use a reader like the `WholeSiteReader`, which uses a headless Chrome driver to render the page before returning the documents. For here\n",
    "we'll use a page with server-side rendered HTML for simplicity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "URLS = [\n",
    "    \"https://news.ycombinator.com/item?id=41694044\",\n",
    "    \"https://news.ycombinator.com/item?id=41696046\",\n",
    "    \"https://news.ycombinator.com/item?id=41693087\",\n",
    "    \"https://news.ycombinator.com/item?id=41695756\",\n",
    "    \"https://news.ycombinator.com/item?id=41666269\",\n",
    "    \"https://news.ycombinator.com/item?id=41697137\",\n",
    "    \"https://news.ycombinator.com/item?id=41695840\",\n",
    "    \"https://news.ycombinator.com/item?id=41694712\",\n",
    "    \"https://news.ycombinator.com/item?id=41690302\",\n",
    "    \"https://news.ycombinator.com/item?id=41695076\",\n",
    "    \"https://news.ycombinator.com/item?id=41669747\",\n",
    "    \"https://news.ycombinator.com/item?id=41694504\",\n",
    "    \"https://news.ycombinator.com/item?id=41697032\",\n",
    "    \"https://news.ycombinator.com/item?id=41694025\",\n",
    "    \"https://news.ycombinator.com/item?id=41652935\",\n",
    "    \"https://news.ycombinator.com/item?id=41693979\",\n",
    "    \"https://news.ycombinator.com/item?id=41696236\",\n",
    "    \"https://news.ycombinator.com/item?id=41696434\",\n",
    "    \"https://news.ycombinator.com/item?id=41688469\",\n",
    "    \"https://news.ycombinator.com/item?id=41646782\",\n",
    "    \"https://news.ycombinator.com/item?id=41689332\",\n",
    "    \"https://news.ycombinator.com/item?id=41688018\",\n",
    "    \"https://news.ycombinator.com/item?id=41668896\",\n",
    "    \"https://news.ycombinator.com/item?id=41690087\",\n",
    "    \"https://news.ycombinator.com/item?id=41679497\",\n",
    "    \"https://news.ycombinator.com/item?id=41687739\",\n",
    "    \"https://news.ycombinator.com/item?id=41686722\",\n",
    "    \"https://news.ycombinator.com/item?id=41689138\",\n",
    "    \"https://news.ycombinator.com/item?id=41691530\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll define a basic event, as events are the standard way to pass data between steps in LlamaIndex workflows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class CompletedDocumentRetrievalEvent(Event):\n",
    "    name: str\n",
    "    documents: list[Node]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After that we'll define the workflow itself. In our case, this will have one step to ingest the documents from the web, one to ingest them to Airtrain, and one to wrap up the workflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class IngestToAirtrainWorkflow(Workflow):\n",
    "    @step\n",
    "    async def ingest_documents(\n",
    "        self, ctx: Context, ev: StartEvent\n",
    "    ) -> CompletedDocumentRetrievalEvent | None:\n",
    "        if not ev.get(\"urls\"):\n",
    "            return None\n",
    "        reader = AsyncWebPageReader(html_to_text=True)\n",
    "        documents = await reader.aload_data(urls=ev.get(\"urls\"))\n",
    "        return CompletedDocumentRetrievalEvent(\n",
    "            name=ev.get(\"name\"), documents=documents\n",
    "        )\n",
    "\n",
    "    @step\n",
    "    async def ingest_documents_to_airtrain(\n",
    "        self, ctx: Context, ev: CompletedDocumentRetrievalEvent\n",
    "    ) -> StopEvent | None:\n",
    "        dataset_meta = upload_from_llama_nodes(ev.documents, name=ev.name)\n",
    "        return StopEvent(result=dataset_meta)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the workflow API treats async code as a first-class citizen, we'll define an async `main` to drive the workflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "async def main() -> None:\n",
    "    workflow = IngestToAirtrainWorkflow()\n",
    "    result = await workflow.run(\n",
    "        name=\"My HN Discussions Dataset\",\n",
    "        urls=URLS,\n",
    "    )\n",
    "    print(\n",
    "        f\"Uploaded {result.size} rows to {result.name}. View at: {result.url}\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we'll execute the async main using an asyncio event loop.\n",
    "\n",
    "⚠️ **Note** ⚠️: If you are on an Airtrain trial and already ran examples above,\n",
    "you will need to delete the resulting dataset(s) before uploading a new one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "error fetching page from https://news.ycombinator.com/item?id=41693087\n",
      "error fetching page from https://news.ycombinator.com/item?id=41666269\n",
      "error fetching page from https://news.ycombinator.com/item?id=41697137\n",
      "error fetching page from https://news.ycombinator.com/item?id=41697032\n",
      "error fetching page from https://news.ycombinator.com/item?id=41652935\n",
      "error fetching page from https://news.ycombinator.com/item?id=41696434\n",
      "error fetching page from https://news.ycombinator.com/item?id=41688469\n",
      "error fetching page from https://news.ycombinator.com/item?id=41646782\n",
      "error fetching page from https://news.ycombinator.com/item?id=41668896\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Uploaded 20 rows to My HN Discussions Dataset. View at: https://app.airtrain.ai/dataset/bd330f0a-6ff1-4e51-9fe2-9900a1a42308\n"
     ]
    }
   ],
   "source": [
    "asyncio.run(main())  # actually run the main & the workflow"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
